-
Notifications
You must be signed in to change notification settings - Fork 110
Get all chunk references for a given file #1185
Conversation
|
||
// testRuns[i] and expectedLen[i] are dataSize and expected length respectively | ||
testRuns := []int{1024, 8192, 16000, 30000, 1000000} | ||
expectedLens := []int{1, 3, 5, 9, 248} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only aware of one bug in the pyramidchunker, which occurs - as far as I remember, although this is awhile ago - when you have all batches in a tree filled plus one chunk. Neither 247 or 248 match the count of such a configuration.
Meanwhile, for 1000000 bytes 248 should be correct:
ceil(1000000/4096) = 245
ceil(245/128) = 2
245 data chunks + 2 pointer chunks + 1 root chunk = 248
If you can reproduce this anomaly using the same data (randomized but from a fixed seed), then perhaps we could discover which chunk goes missing, and if it's the same one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is indeed flaky, alternating between 247 and 248, we must find what is going on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this PR. What is the use-case for it?
return nil, err | ||
} | ||
// collect all references | ||
for _, ref := range putter.References { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that the references are returned in arbitrary order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nolash is right - we should sort the before returning them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what sort order should this be? By value? By hierarchy? If the latter, how to achieve that when the putter will get the hashes in no specific order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alphabetical - so that when we have a tool to query N nodes whether they have a list of chunks, they are all sorted in the same order and we can quickly merge and check N such lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add alphabetical ascending sort order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, alphabetical order would be the best.
// HashExplorer's Put will add just the chunk hashes to its `References` | ||
func (he *HashExplorer) Put(ctx context.Context, chunkData ChunkData) (Reference, error) { | ||
// Need to do the actual Put, which returns the references | ||
ref, err := he.hasherStore.Put(ctx, chunkData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my, it's pretty clumsy having to hash everything twice... I wonder why the pyramidsplitter only returns the data, not the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nolash not everything is hashed twice - and the points in time for the requests are separated.
We use Store
when we actually really store a data structure on swarm. Then we use the "conventional" PyramidSplit
, and having all references for the data structure is (currently) not needed and useless.
We use GetAllReferences
for debugging, when we ask a node "do you actually (still?) have a chunk with hash abc123
in your store? Then It will be hashed again (but only by the checking node) to get all references for the given data structure, so it is at another point in time, and optionally for user-selected data structures.
Is the use case clear now?
Reopened upstream as ethereum/go-ethereum#19002 |
This PR adds an endpoint to
FileStore
which allows to get a list of hashes for a given file.